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Abstract 

Suppose  a  box  contains  m  balls,  numbered  from  1  to  m.  A  random  number  of  balls 
are  drawn  from  the  box,  their  numbers  are  noted,  and  the  balls  axe  then  returned  to  the 
box.  This  is  done  repeatedly,  with  the  sample  sizes  being  iid.  Let  X  be  the  number  of 
samples  needed  to  see  all  the  balls.  This  paper  derives  a  simple  but  typically  very  accurate 
approximation  for  EX  in  terms  of  the  sample  size  distribution.  The  justification  of  the 
approximation  formula  uses  Wald’s  identity  and  Markov-chain  coupling. 


1.  Introduction 


Suppose  we  have  a  box  containing  m  identical  white  balls.  Let  K\,K2,  . . .  be  iid 
random  variables  taking  positive  integer  values.  We  randomly  sample  K\f\m  balls  without 
replacement,  paint  the  sampled  balls  red,  and  return  them  to  the  box.  Then  K2  Am  balls 
are  sampled,  painted  red,  and  returned  to  the  box,  etc.  Let  X  be  the  number  of  samples 
needed  to  paint  all  the  balls  red.  When  ^max^  Ki  is  with  high  probability  small  compared 
to  m,  a  good  approximation  for  EX  is  given  by 


(1.1) 


£t 

»=  1 


m— 1 

£  >  i) 

i=0 


+ 


£ 1  >r)t^ 


j+i 


mf  ^P{K  >  i } 

1=0 


(A  JRT  without  a  subscript  represents  a  generic  JC,.)  For  instance,  if  m  =  10  and  the 
(Ki  —  l)’s  are  binomial  (4,  |),  then  the  true  value  of  EX  is  8.8937,  while  (1.1)  gives 
8.8933.  For  m  =  20  and  (Ki  —  l)’s  which  are  binomial  (4,  ^),  the  values  are  EX  = 
22.90753529760067  and  (1.1)  =  22.90753529760074.  For  m  =  10  and  K,'s  which  are 
uniformly  distributed  on  {1,2, 3,4, 5},  EX  =  8.74239  and  (1.1)  =  8.74236.  For  m  =  20, 
the  values  Eire  EX  =  22.740208948996  and  (1.1)  =  22.740208948981.  (The  true  values 
were  computed  by  Jacek  Dmochowski  using  exact  recursive  formulas  suggested  by  Larry 
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Shepp.)  It  is  difficult  to  determine  the  size  of  the  approximation  error  when  m  is  much 
larger  than  20.  Even  with  double-precision  computation,  the  round-off  error  seems  to 
dominate  the  true  approximation  error. 

The  justification  of  formula  (1.1)  involves  Wald’s  identity  (which  gives  the  first  term 

of  (1.1)  as  a  first  approximation  to  EX)  and  coupling  (which  gives  the  second  term  of 

(1.1)  as  a  correction  for  “boundary  overshoot  error”  in  the  first  term).  A  bound  on  the 

difference  between  EX  and  (1.1)  can  be  given  in  terms  of,  say,  P  {max  if;  >  ■?•}  and  the 

«<X  i 

probability  that  a  certain  Markov-chain  coupling  is  unsuccessful. 

Section  4  gives  some  explicit  bounds  on  the  approximation  error.  These  bounds  Eire 
generally  very  crude,  but  they  show  that  the  approximation  error  converges  to  zero  faster 
than  exp(— ms  )  eis  m  — ►  oo  when  the  (fixed)  if -distribution  has  a  finite  moment  generating 
function  near  0.  If  the  if,’s  are  bounded,  or  if  the  hazard  function  of  the  if,  ’s  is  bounded 
below  by  6  >  0,  then  the  approximation  error  converges  to  zero  exponentially  (in  m)  as 
m  — *•  oo. 

Section  5  presents  a  generalization  of  (1.1)  applicable  to  the  case  where  some  of  the 
balls  in  the  box  are  red  to  begin  with,  and  where  only  a  specified  number  of  the  white 
balls  need  to  be  painted  red. 

2.  Application  of  the  Wald  Identity 

Assume  the  balls  axe  numbered  1, 2, . . . ,  m.  Sampling  k  balls  without  replacement  can 
of  course  be  done  by  sampling  with  replacement  until  k  distinct  bcdls  have  been  drawn,  with 
repetitions  ignored.  This  section  will  relate  the  sampling-without-replacement  scenEtrio 
described  in  the  introduction  to  the  process  of  repeatedly  drawing  balls  one  at  a  time, 
with  replacement.  Analyzing  the  number  r  of  single-ball  draws  needed  to  see  elLI  the  balls 
is  easy:  it’s  just  the  stEindaxd  coupon  collector  problem.  Except  for  “boundary  overshoot 
error,”  the  Wald  identity  will  give  an  expression  for  EX  in  terms  of  Er. 

Suppose  our  successive  sEunples  of  bsills  axe  obtained  as  follows.  First  we  see  the  value 
of  K\.  Then  we  sample  balls  one  at  a  time,  with  replacement,  until  K\  Am  distinct  balls 
have  been  obtained.  Let  D\  be  the  number  of  single-ball  draws  needed  to  obtain  the 
required  K\  A  m  distinct  balls.  Then  we  see  if 2  and  sample  balls  one  at  a  time,  with 
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replacement,  until  Ki  A  m  distinct  balls  have  been  obtained  to  form  the  second  sample, 
etc.  Let  D{  be  the  number  of  single-ball  draws  needed  to  obtain  the  AT,  A  m  distinct 
balls  in  the  ith  sample.  Let  JF,  be  the  <7-field  generated  by  all  observations  made  while 
generating  the  first  i  samples.  Thus,  is  generated  by  K\ ,  K?, . . . ,  A'*  and  by  the  sequence 
of  £>!  +  ...  +  Di  single-ball  draws  needed  to  obtain  the  first  i  samples.  The  Ki's  are  of 
course  assumed  to  be  iid.  The  ball  numbers  of  the  balls  obtained  on  successive  draws  are 
iid,  uniformly  distributed  on  {1,2,...,  m},  and  independent  of  the  ATj’s.  The  Di  s  are  also 
iid. 


Let  r  be  the  number  of  single-ball  draws  needed  to  see  every  ball  at  least  once.  The 
number  of  additional  single-ball  draws  needed  to  see  a  new  ball  after  j  balls  have  already 
been  seen  is  a  geometric  (*~^)  random  variable  independent  of  everything  that  has  come 
before.  (This  is  the  standard  coupon  collector  argument.)  Adding  up  expectations  yields 


(2.1) 


m  — 1  m  - 

£r=  = 

m  —  j  l—j  i 

>=o  J  «=i 


Again  letting  X  be  the  number  of  samples  needed  to  see  all  m  balls,  we  see  that 

X  —  inf{t:  D\  +  Z?2  +  •  •  •  +  Di  >  r). 

Furthermore,  AT  is  an  Ti  stopping  time.  Thus,  by  Wald’s  identity  and  the  fact  that  the 
DC s  are  iid, 

(2.2)  £(£>,  +  . . .  +  Dx)  =  EX  EDX. 

The  expectation  ED\  is  easy  to  find  in  terms  of  the  distribution  of  K\.  The  coupon 
collector  argument  shows 

k- 1 


E{D1  \KX  A  m  =  k)  =  —~ 

»=o  m  1 


so 


k-l 


(2.3) 


i(D,)  =  £j>{ir,Am  =  l)£  — 

-e^7 


m  —  i  , 
t=0  fc=i+l 


m  — 1 


-E 


m 


i=0 


m  —  i 


rP{K  >  i). 
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Now  we  need  to  relate  E(D\  +  . . .  +  Dx )  to  Et.  The  sum  Di  and  r  axe  of  course 

1 

equal  except  for  the  “overshoot”  given  by  the  number  of  draws  needed  to  complete  the 
last  sample  after  the  last  new  ball  has  been  obtained  on  the  rth  draw.  Let  V  be  the  size 
of  this  overshoot,  so  that 

(2.4)  V=(Y,D<)-r. 


From  (2.1),  (2.2),  (2.3),  and  (2.4)  we  get 


x 


(2.5)  EX  =  -J— 

Er+EV 

EDi 

m 

mZ\  +  EV 

«=  i _ 

m— 1 

E  ^r,P{X  >  i) 

1=0 

Let  J  be  the  number  of  distinct  balls  already  in  the  last  sample  after  the  rth  draw. 
Given  that  J  —  j  and  Kx  =  k  (which  is  only  possible  for  k  >  j),  the  coupon  collector 
argument  shows 


m  —  r 


(If  A:  =  j,  the  right  side  is  supposed  to  be  zero.)  Furthermore,  the  conditional  distribution 
of  Kx  given  that  J  =  j  is  exactly  the  same  as  that  of  K\  given  that  K\  >  j.  (This  is 
perhaps  even  more  obvious  when  the  sampling  is  done  as  described  in  Section  3.)  Thus, 


E(V\J=j)=  J2  P{E  =  k\K>j}J2-r~: 

*=.7+1  r=7 

m  —  1  m 

=  Err  E  p{x  =  k\K>n 

t—-*  m  —  r  * — 1 

r=j  k=r+l 

m  — 1 

=  '£~L-P{K>r\K>j) 

*■ — '  m  —  r 
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(Note  that  (2.7)  is  zero  for  j  =  m.)  If  we  can  get  a  good  approximation  for  the  distribution 
of  J,  we  can  combine  this  with  (2.7)  to  approximate  EV  in  (2.5).  The  next  section  will 
show  that  P{J  =  j}  is  typically  approximately  proportional  to 

- P{K  >  j  -  1}  =  --m—P{K  >  j}, 

m  —  (j  —  1)  m  —  j  +  1 


so  that 


p{j=i) 


^tP(K  >  ;} 


m—j+1 


E  srrbrrm  >  *  - 1} 


t-.  m— (/-!)'  I'1  ' 

^P{K>j) 

m  —  1 

E  ^ P{K  >  •'} 

i= 0 


Combining  (2.S)  with  (2.7)  yields 


EV=  £  E(V\J  =  j)P{J  =  j) 


m  — 1  m  — 1 

E  E  >  r\K  >  j}^rP{K  >  j] 

- - - 

m  — 1 

E  ^r.PW  >  •} 

i=0 

m  — 1  r 

E  ^?P{K  >r}£ 

_  _ 

E1  ^7 P{K  >  •'} 

1  =  0 

Substituting  this  approximation  for  EV  into  (2.5)  gives  (1.1). 


3.  Approximation  of  P{J  =  j)  by  Coupling 


This  section  will  justify  the  approximation  (2.8)  above: 


(3.1) 


.,  m.>;}  . 

=  l)al^T - •  J  =  1,2 — ,i 


E  sW  >  0 


The  idea  is  to  construct  a  finite-state-space  Markov  chain  with  m  absorbing  states  labelled 
1, 2, . . . ,  m  for  which  J  equals  the  label  of  the  state  where  absorption  occurs.  This  chain 
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is  coupled  to  an  approximating  chain  for  which  the  distribution  of  the  absorbing  state  is 
given  (modulo  truncation)  by  the  right  side  of  (3.1).  (For  a  general  account  of  the  coupling 
technique,  see  Lindvall  (1992).) 

For  the  purposes  of  this  section,  it  will  be  better  to  use  a  recipe  for  generating  the 
required  samples  which  is  different  from  that  of  the  previous  section.  To  start,  draw  a 
single  ball  from  the  box.  Then  ask  whether  the  next  bail  should  continue  the  first  sample. 
The  probability  of  continuing  the  first  sample  should  be 


ci  =: 


P{K  A  m  >  1} 
P{K  Am>  1} 


If  we  decide  to  continue  the  first  sample,  draw  another  ball  from  the  box  without  replacing 
the  first  ball.  With  probability 


hi  =:  1  —  ci  = 


P{K  A  m  =  1} 
P{KAm  >  1}’ 


we  decide  to  end  the  first  sample  with  the  first  ball.  In  this  case,  we  return  the  first  ball 
to  the  box  and  then  draw  the  first  ball  of  the  second  sample. 

The  general  rule  for  sampling  goes  as  follows.  If  the  number  of  balls  already  in  the 
current  sample  is  a,  we  decide  to  continue  this  sample  with  the  next  ball  with  probability 

_ P{K  A  m  >  o) 


Cq  — • 


P{K  A  m  >  a} 


With  probability 


ha  =:  1  -  ca  = 


P{K  A  m  =  o} 


IU.UI  .»a  — .  i  '-a  —  Df  r/  .  1  1 

v  '  P\K  f\m>  a) 

we  end  the  current  sample,  return  all  balls  in  this  current  sample  to  the  box,  and  then 
draw  the  first  bail  of  the  next  sample.  (Note  that  ha  is  the  hazard  function  of  the  K 
distribution.)  Balls  are  returned  to  the  box  only  when  we  decide  that  the  current  sample 
is  complete  and  that  it’s  time  to  start  a  new  one.  It  should  be  obvious  that  this  protocol 
really  does  generate  independent  samples  whose  common  sample  size  distribution  is  as 
required. 

Let  Qn  be  the  tr-field  generated  by  everything  that  happens  up  to  and  including  the 
nth  ball-draw  (but  not  including  the  decision  as  to  whether  the  nth  ball  is  the  last  ball  of 
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the  sample  containing  it).  Let  An  be  the  number  of  bails  already  in  the  current  sample 
after  the  ntA  ball- draw.  Let  Vn  be  the  number  of  “virgin”  balls  left  to  be  drawn  for  the 
first  time  after  the  nth  draw.  Define  the  Qn  stopping  time  To  by 

(3.6)  T0  =  inf{n:  Vn  =  0}, 

so  that  To  is  the  number  of  ball  draws  needed  to  get  all  balls  at  least  once  (according  to 
the  sampling  protocol  of  this  section.)  Note  that  At0  =  J ■  Define 

(3.7)  An  =.  AnhT<)  and  Vn  *  Vn/\To • 

Then  {(^4n,  Vrt)}^L1  is  a  Markov  chain  adapted  to  Qn.  Since  we  start  with  n  =  i,  the 
starting  state  is(l,m  —  1).  The  state  space  is 

Sm  =  :{(a,i>):a  €  {1,2,  v  €  {0, 1, . . .  ,m  -  1},  a  +  v  <  m). 

For  v  >  1,  the  transition  probabilities  are 

(3.8)  P{(An+i,  Vn+1)  =  (a',i/')M«,  Vn)  =  (a,  v)} 

=  ha ^  for  (a',V')  =  (l,W 

=  ha%  for  (a',v')  =  (l,v  -  1) 

=  for  (a\i/)  =  (a  +  l,u) 

=  ^a-~  for  (a\v')  =  (a  +  l,v  -  1) 

=  0  othe:  wise. 

Since  the  (A„,  Vn)  chain  stops  at  time  To,  states  of  the  form  (a,  0)  are  of  course  absorbing, 
so  that 


(3.9)  P{(An+l,  V„+1)  =  (o',»')IMn,V„)  =  (o,0)} 

=  1  if  =  (o,0) 

=  0  otherwise. 


Since  At0  =  J,  we  can  analyze  the  behavior  of  J  by  getting  the  approximate  probabilities 
of  absorption  at  each  of  the  m  absorbing  states  (a,  0). 


Fix  6  6  {1,2, ...,m  -  1}.  For  a  =  1,2, define 


(3.10) 


f  ca  if  a  <  6 
l0  if  a  >  b 


7 


and 


(3.11) 


(6)  _  f  ha  if  a  <  b 

a  1  1  if  a>b 


Let  (An  \  Vri^)  be  a  Markov  chain  on  Sm  whose  transition  probabilities  axe  given  by  (3.8) 
and  (3.9),  except  with  and  in  place  of  ca  and  ha.  Let  be  the  filtration  for  this 
chain. 


As  far  as  the  transition  probabilities  are  concerned,  the  ( An\Vnb chain  is  like  the 

(An i  Vn)  chain  with  K  A  b  in  place  of  if  as  the  generic  sample  size.  If  P{K  >  b}  =  0,  the 

transition  probabilities  for  the  two  chains  are  of  course  exactly  the  same.  Our  coupling 

construction  below  will  work  well  when  we  can  choose  a  b  not  too  close  to  m  for  which 

P{K  >  b}  is  so  small  that  Pfmaxif,  >  6}  is  itself  negligible. 

i<X 

The  {(A.L6\  Markov  chain  will  be  started  with  =  m  —  b.  The  value  of 

A^  will  be  random,  with 

(3.12)  P{4‘)  =  a}  =  ,  1  <  a  <  b. 

1=0 

(Compare  (3.12)  to  (3.1).  The  distribution  for  Aj6^  in  (3.12)  is  the  distribution  on  the 
right  side  of  (3.1),  conditioned  to  be  <  b.) 


For  v  6  {0, 1, ...  ,m  —  6},  define 

(3.13)  4»  =  mf{n:V(b)  =  v}. 

Since  is  always  either  0  or  —1, 

(3.14)  1  =  >  TiV, 


Here  is  the  key  result  for  our  justification  of  (3.1): 

Proposition  3.15. 

If  =  m—b  and  Aj6^  has  the  distribution  given  by  (3.12),  then  A^k)  has  distribution 
(3.12)  for  all  v  €  {0,1,..., m  —  6}.  In  particular,  the  first  coordinate  A^t)  of  the  state 
(A^k),0)  where  absorption  occurs  has  distribution  (3.12). 
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Proof. 


Define  {bH’^}'%L1  to  be  a  Markov  chain  on  the  state  space  {1, 2, . . . ,  b]  which  acts  like 
a  non-stopped  version  of  Thus,  the  transition  probabilities  for  the  Bn^  chain  are 


(3.16) 


=  a'l  =  a} 


_  (6)  _  P{*A6>a} 

-  Ca  -  p{/0\6>a}  11  °  —  0+1 

_  Ab)  _  P{KAb=a}  -f  /  _  1 

~  Ua  ~  P{/CA6>aj  11  °  —  1 


otherwise. 


It  is  well-known  (and  trivial  to  check)  that  the  stationary  distribution  of  this  chain  is 


(3.17) 


P{K  >  a)  P{K>a) 

«=i 


Define  another  Markov  chain  (B^\  W7,^)  by  having  B ^  as  above  and  the  kF^’s  a 
sequence  of  Bernoulli  random  variables.  Set  =  1.  Conditional  on  the  entire  path 

B(b)  =:{B{b),B(b\...) 

of  the  B^  chain,  the  W^’s,  n  >  2,  are  to  be  independent,  with 

(3.18)  P{W7<i)  =  1|P(4)}  =  1  -  P{WW  =  0|S(4)}  - - 4 - 

m  -  B(n}  +  1 

for  some  constant  d,  0  <  d  <  m  —  6. 


Now  consider  the  “embedded  chain”  whose  consecutive  states  are  the  consecutive 
states  of  the  (B^\w},b^)  chain  at  .vhich  =  1.  (The  times  at  which  W7^  =  0  are 
skipped.)  The  stationary  distribution  for  B^  in  this  embedded  chain  is 

(3.19)  7 r«mb  =  urn-a±]7Ta  =  ,  1  <  a  <  b. 

E  E  ^iPiK  >  *') 

»=i  i=i 

(Note  that  (3.12)  and  (3.19)  are  the  same  distribution.)  The  reasoning  for  (3.19)  is  as 
follows.  The  stationary  distribution  of  an  ergodic  Markov  chain  gives  the  long-run  fraction 
of  the  time  that  the  chain  spends  in  each  state.  By  (3.17),  (3.18),  and  the  strong  law  of 
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large  numbers,  the  long-run  fraction  of  the  time  that  the  chain  spends  in  a 

state  (a,  1)  equals 

d 

- — • 

m  —  a  -f  1 

Thus,  the  long-run  fraction  of  the  time  that  the  embedded  chain  spends  in  a  state  (a,  1)  is 
given  by  (3.19). 


Recall  that  if  the  starting  state  of  an  ergodic  Markov  chain  is  chosen  according  to 
the  stationary  distribution,  then  the  state  one  time  unit  later  will  also  have  the  stationary 
distribution. 

Now  compare  the  (,<4n  \  V^)  chain  with  the  chain.  The  differences  — 

are  Bernoulli  random  variables.  For  n  < 


P{V?\  -  V™  =  = 


m  —  b 


m  —  An  +  1  ’ 

which  looks  like  (3.18)  with  d  =  m  —  b.  Thus,  up  until  time 


J  m  —  b  —  1 


=  inf{n  =  Vjt1,  -  V„<»>  =  1}, 


(An  \  ~  V'^6))  is  a  chain  which  acts  just  like  (.522  Since  A^(\}  corresponds 

m-I>-  1 


to  the  second  value  of  in  the  embedded  chain,  A(,D(Jt)  will  have  the  stationary 

m  —  b  —  a 

distribution  (3.12)  =  (3.19)  if  A^  starts  out  in  this  stationary  distribution. 

It  follows  in  the  same  way  that  has  distribution  (3.12)  if  has  distribution 

1 

(3.12).  Thus,  Proposition  3.15  follows  by  induction. 

□ 


.(*) 


Now  the  idea  is  to  show  that,  with  high  probability,  (An,V„)  and  Vn^)  can  be 
coupled  so  as  to  end  up  at  the  same  absorbing  state.  We  start  by  choosing  a  starting  state 
for  the  (An\  14^)  chain  as  described  above,  with  V ^  =  m  —  b.  Then  we  let  the  (A„,  V„) 
chain  (which  always  starts  in  state  (1,  m  —  1))  run  until  F„  =  m  —  b ,  without  the  (A*  \  vi^) 
chain  moving.  Once  the  chains  are  on  the  same  F-level,  we  can  sequentially  choose  one 
or  the  other  chain  to  take  a  step,  with  the  goal  of  course  being  to  get  them  to  inhabit  the 
same  state  at  the  same  time.  Or,  we  can  let  both  chains  move  simultaneously,  with  the 
transitions  for  the  two  chains  being  dependent  if  we  want.  The  requirement  is  only  that, 
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each  time  a  chain  takes  a  step,  it  must  do  so  according  to  its  own  transition  probabilities; 
this  will  guarantee  that  the  path  of  each  individual  chain  has  the  right  probabilities.  (In 
particular,  the  distribution  of  the  absorbing  state  will  be  correct.)  If  we  can  get  the 
two  chains  to  meet,  we  couple  them  so  that  they  move  together.  Since  the  transition 
probabilities  are  the  same  as  long  as  An  and  are  <  b,  the  chains  stay  coupled  as  long 
as  the  common  A-coordinate  is  <  b.  A  coupling  can  fall  apart,  however.  If  the  common 
A-value  of  the  two  coupled  chains  reaches  6,  then  with  probability  ct  the  chains  will  be 
decoupled  after  the  next  step.  If  one  chain  has  its  V  coordinate  decrease  before  coupling 
is  achieved  on  a  common  F-level,  we  just  leave  this  chain  alone  for  a  while  and  run  the 
other  chain  until  its  V  coordinate  also  drops,  after  which  we  try  to  achieve  a  coupling  at 
this  next  lower  V- level. 

It  does  not  seem  easy  to  describe  (or  determine)  optimal  coupling  schemes.  However, 
even  very  simple-minded  strategies  can  sometimes  have  very  high  probabilities  of  successful 
coupling. 

Let  Aoo  =  At0  and  A'^}  =  A\ denote  the  A-values  of  the  absorbing  states  for  the 

•*o 

two  Markov  chains.  Let 

urn*.  s }  -  6  -}n = £;  = j j  -  = ;}i 

i 

denote  the  total  variation  distance  between  the  distributions  of  A0 0  and  A»  . 

Proposition  3.20. 

If  P{K  >  6}  =0,  then 

ll-PMoo  €  •}  -  S  -}||  <  2exp(-~A)  <  15e— /», 

where  the  distribution  of  is  given  by  the  right  side  of  (3.12). 

Proof. 

The  total  variation  distance  is  obviously  bounded  by  twice  the  probability  that  the 
An  and  A^  chains  are  not  coupled  when  they  are  absorbed,  (cf  Lindvall  (1992),  page  12.) 
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Here  is  a  coupling  strategy  which  has  probability  less  than  e(  —  — )  that  the  chains 
are  not  coupled  at  absorption.  For  the  (at  most)  b  possible  states  in  Sm  with  V  coordinate 
v ,  order  the  states  as  follows: 


(2,u)  (3,u)  x  ...  -<  ( b,v )  -< 


When  both  chains  have  their  V  coordinates  equal  to  v,  run  the  chain  which  is  “behind” 
according  to  this  ordering  and  leave  the  other  chain  alone.  If  the  chains  meet,  couple 
them.  (Since  An  >  b  is  impossible,  the  coupling  will  not  be  broken  later.)  If  one  chain’s 
V  coordinate  decreases  (by  1),  run  the  other  chain  until  its  V  coordinate  decreases  by  1 
also. 


If  no  decrease  in  a  V  coordinate  occurs  for  the  first  6—1  steps  taken  when  the  Markov 
chains  are  both  on  V-level  v,  then  the  chains  are  guaranteed  to  couple  on  this  V-level. 
Each  time  a  chain  takes  a  step  starting  on  V-level  v,  the  probability  that  its  V  coordinate 
will  not  decrease  is  at  least  (See  (3.8).)  Thus,  the  probability  of  no  coupling  on 

V-level  v  is  at  most 

The  probability  that  coupling  never  occurs  on  any  of  the  V-levels  m  —  b ,  m  —  6—  1, ...  ,2, 1 
is  at  most 


^.21) 


m  —  b 


u=l 


m—b—  1 


i=0 


m  —  b  —  1 


<  exp{—  ( - -)b  since  1  —  x  <  e~z. 

~  1  ^  m-b  J 


i= 0 


But 

(3.22) 


m— 6— 1  .  m  —  b 


> 


(m  — 


xb  ldx  — 


1 


m  —  b  n  m  —  2b 

6  1  =  b 
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Together,  (3.21)  and  (3.22)  imply  that  exp(  —  mt?fc)  bounds  the  probability  of  the 
chains  not  being  coupled  at  absorption. 

□ 

We  will  see  that  Proposition  3.20  still  holds  even  when  P{Ki  >  b}  >  0,  provided  we 
add  2P{max/v,  >  6}  onto  the  right  side  of  the  inequality  in  Proposition  3.20.  To  bound 

i<A' 

P{maxiv,  >  &},  it  will  help  to  have  a  bound  on  X. 
i<X 

If  we  let  r  be  the  number  of  single-ball  draws  (with  balls  replaced  after  each  draw) 
needed  to  see  every  ball  at  least  once  (as  in  Section  2),  then  it  is  obviour  that  P{X  >  t)  < 
P{T  >  0  for  all  t.  (Recall  we  assume  P{K  >  1}  =  1.) 
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But 


Thus, 


However, 


m 

y—= 


1  1  1 
1+3  +  5  +  -"+toT3T 


4  1  f2ml 

3  +  2  J4  x* 


<  1  +  -  log  m. 


E(  1  +  — -)r  <  (me)i. 
2m ' 


(1  + -^-)m2  >  (e2'">  I^)mJ=e?  * 
2m 


(since  log  (1  +  x)  >  x  — —  for  0  <  x  <  1). 


Thus,  by  the  Markov  inequality, 


P{t  >  m2}  <  (me)*e ®  i’  <2mJe  ”  . 


Corollary  3.24.  P{X  >  m2}  <  2m*  e  m/2. 


Proposition  3.25. 


If  P{K  >  b}  <  e,  then 


||P{^oo  €  •}  -  P{A{£  €  -}||  <  15e"m/6  +  2m2e  +  4m*e-m/2. 


Proof. 

Use  the  same  coupling  strategy  as  in  the  proof  of  Proposition  3.20.  Expression  (3.21) 

(and  therefore  ye'm^)  bounds  the  probability  of  the  event  “no  (initial)  coupling  occurs 

and  max  Ki  <  6.”  But  by  Corollary  3.24, 

KX 


(3.26) 


P{max Ki  >  6}  <  m2P{K  >  b)  +  P{X  >  m2} 

i>X 

<  m2e  +  3ms e_m/2. 


14 


Thus  the  probability  of  “no  (initial)  coupling  occurs  or  max  K,  >  b”  is  bounded  by 

i<X 

(3.27)  ye-m/fc  +  m2£  +  2rn*e-m/2. 

Note  that  a  coupling,  once  made,  is  never  broken  on  {max  if,  <  b}.  Thus,  (3.27)  bounds 

i<X 

the  probability  that  the  (An,Vn)  and  (A^n\Vn^)  chains  are  not  coupled  at  absorption. 
Multiplying  (3.27)  by  2  gives  the  desired  bound  on  the  total  variation  distance. 

□ 

If  the  hazard  function  ha  of  the  if,’ s  is  bounded  below,  a  different  coupling  rule 
sometimes  works  better  than  the  one  used  above.  The  bound  in  the  next  proposition 
should  be  particularly  good  if  the  hazard  function  ha  is  (approximately)  constant,  so  that 
the  if,’ s  are  (approximately)  geometric  random  variables. 


Proposition  3.28. 

If  the  hazard  function  ha  of  the  Jf;’s  is  bounded  below  by  6  >  0  for  a  <  b,  and  if 
(m  —  b)6  >  1 ,  then 


{cS  m~b 

(1  -fT)1**  J  +  4m’e”m/2  +  2m2(l  -  S)b. 


Remcrh.  If  the  if,’ s  a’-e  geometric  (|)  and  m  =  100,  then  talcing  6  =  6 2  and  6  =  |  causes 
the  bound  in  Proposition  3.28  to  equal  1.08  xlO-14. 


Proof. 

Again,  if  the  Markov  chains  are  on  different  V-levels,  run  only  the  one  on  the  higher 
V-level  until  it  drops  down  to  the  V -level  where  the  other  is.  When  both  chains  tire  on 
the  same  VMevel,  the  strategy  here  will  be  to  have  the  chains  move  simultaneously  and 
dependently.  The  goal  will  be  to  get  both  chains  to  jump  to  state  (l,v)  at  the  same  time, 
or  to  jump  to  state  (l,v  —  1)  at  the  same  time. 

Lemma  3.29.  Suppose  the  (An,Vn)  and  (A^n\  V^)  chains  jure  both  on  V-level  v  in 
different  states.  Then  by  having  the  chains  take  dependent,  simultaneous  steps,  it  is 
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possible  to  make  the  (conditional,  given  the  past)  probability  of  the  event  “ An  does  not 
exceed  b  and  the  chains  don’t  couple  on  or  before  the  first  time  that  at  least  one  chain 
drops  to  V-level  v  —  1”  less  than  or  equal  to 

Proof  of  Lemma  3.29. 

We  need  to  describe  the  dependence  between  the  next  step  of  the  (An,V„)  chain 
and  the  next  step  of  the  (AL6\  V^)  chain.  We  first  determine  the  next  A  values.  By 
assumption,  when  An  <  b  each  chain  has  probability  >  6  of  having  its  A  coordinate  jump 
to  1  on  the  next  step.  If  An  =  6,  then  An+i  will  either  equal  1  or  b+  1,  and  the  (A^\  V„(6)) 
chain  of  course  still  has  probability  >  6  of  the  next  A  value  being  1. 

Thus,  when  An  <  b  we  can  choose  the  next  A  values  in  such  a  way  that  the  event 
“either  An+i  >  b ,  or  both  chains  have  their  next  A  values  equal  to  1”  has  probability  >  6. 
After  determining  the  new  A  values  we  determine  the  new  V  values.  If  both  new  A  values 
are  1,  then  for  both  chains  the  conditional  probability  is  that  the  new  V  value  is  v 
and  —  that  the  new  V  value  is  v  —  1.  Thus,  when  both  new  A  values  are  1,  we  can  choose 
the  new'  V  values  to  be  equal  as  well,  so  the  chains  couple.  Providing  that  the  new  A 
value  for  the  (A„,  Vn)  chain  is  <  6,  the  conditional  probability  (given  its  new  A  value)  for 
each  individual  chain  to  drop  to  V-level  v  —  1  is  <  Thus,  we  can  make  the  new  V 

values  dependent  in  such  a  way  that  the  probability  of  a  V-level  decrease  in  either  chain 
>s  <  ^ZT..(uniess  An+i  >  6,  in  which  case  we  don’t  care  about  V  values.) 

If  the  consecutive  steps  of  the  two  chains  axe  made  dependent  els  described  in  the 
previous  paragraph,  then  the  event  “the  A„’s  do  not  exceed  6  and  the  chains  do  not 
couple  on  or  before  the  first  time  a  chain  leaves  V-level  v"  has  probability  less  them  or 
equal  to 


^  +  £  v  +  (m-b)6' 


□ 


Continuation  of  Proof  of  Proposition  3.28. 

Lemma  3.29  immediately  implies  that  the  event  “the  chains  never  couple  (on  any 
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F-level)  and  An  never  exceeds  b  (on  any  V'-level)”  has  probability  less  than  or  equal  to 


(3.30) 


*y-r  v  (m  —  b)\T {(m  -  b)8  +  1} 

LL  v  +  (m-b)6~  r{(m-6)(l  +<5)  +  l} 


where  F(-)  is  the  gamma  function.  To  estimate  (3.30),  we  can  use 

T(x  +  1)  =  V2n  xx+h-xe6(x)lnx,  x  >  0,0  <  6(x)  <  1. 

(See  the  Encyclopedia  of  Statistical  Sciences ,  (1988),  Wiley,  Kotz  and  Johnson,  Eds.,  vol.  8, 
p.  779.)  Applying  this  form  of  Stirling’s  formula  to  (3.30)  yields 


- — ^ — —  <  y/ 27r(m 

=  v  +  (m  -  b)6  v 


<  3  v  m 


-ft  s'  r-‘ 

l(i+«),+‘J 


if  (m  —  b)8  >  1. 


Since  an  achieved  coupling  is  never  broken  on  {max  AT,  <  b },  the  probability  of  the  chains 

»<X 

not  being  coupled  at  absorption  is  bounded  by 

_ _ (  s6  )  m“b 

+P{^K,>b). 

Multiplying  by  2  and  applying  Corollary  3.24  (as  in  (3.26))  produces  the  bound  in 
Proposition  3.28. 


4.  Bounds  on  the  Approximation  Error 

Using  the  bounds  on  the  total  variation  distance  ||P{Aoo  €  ■}  —  P{A^  €  -}|f  given 
in  Section  3,  it  is  easy  to  derive  bounds  on  the  difference  between  (1.1)  and  EX.  Recall 
again  that  J  equals  the  random  variable  Aa 0  from  Section  3. 


Lemma  4.1.  For  any  b  6  {1,2, ...  ,m  —  1}, 

m 

|£f-£e(vV  =  ;W'i£=j}I 


<  \\\P{A„  €  •}  -  P{Ag  £  •JHmaxJJW  =  j), 
2  ; 
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where  E{V\J  =  j)  is  set  equal  to  0  when  P{K  >  j}  =  0. 


Proof. 


The  left  side  equals 


£  E(V\  J  =  j)(P{^  =  j)  -  P{.4«  =  j))|, 


which  is  less  than  or  equal  to  the  right  side. 


Lemma  4.2.  For  any  b  €  {1, ...  ,m  -  1} 

m  —  1  r  6  —  1  r 

E  ^7 P{K  >  r}  £  £  ^P{K  >  r}  E 

r= 1 _ _  ^1 _ ;=1 

m  — 1  6—1 

E  >  >}  E  £rtP{K  >  i } 

1=0  t=0 

m  — 1  r 

E  ^i’{K>’-}E^H 

^  r=b _ jNl _ _ 

E  I^PIK  > .} 

1=0 


Remark.  The  first  term  in  Lemma  4.2  is  the  approximation  (2.9)  for  EV  that  we  used  to 
get  (1.1).  The  second  term  in  Lemma  4.2  equals  E(V\J  =  j)P{A »  =  j). 


Proof. 

r 

H  m_mTi  is  obviously  increasing  in  r,  so  the  first  term  between  absolute  value  signs 

i=\m  1 

is  larger  than  the  second  term.  The  first  numerator  is  greater  than  the  second  numerator, 
and  the  first  denominator  is  greater  than  the  second  denominator.  Thus,  the  difference  is 
less  than  the  difference  between  numerators  divided  by  the  second  (smaller)  denominator. 


18 


Proposition  4.3. 


The  difference  between  EX  and  (1.1)  is  bounded  in  absolute  value  by 


\  5I|PU“ €  >  -  p('4“  £  }IIT“ 


m  — 1 


E  — ~ — P{K  >  r\K  >  j) 
* — '  m  —  r 


r=J 


m  — 1 


E  lSrrP(K  >  r)  £  ;-5 


+ 


r=6 


E  ^r.P{K  >  *} 


1=0 


m-j+1  | 


m  — 1 


i=0 


Proof. 

This  bound  follows  from  (2.5)  and  Lemmas  4.1  and  4.2. 

□ 

Now  let’s  specialize  to  get  bounds  which  are  not  quite  such  a  horrendous  mess  like 
the  bound  in  Proposition  4.3. 


Proposition  4.4. 

If  P{K  >  6}  =  0,  the  difference  between  EX  and  (1.1)  is  bounded  in  absolute  value 


by 


1_5  —  m/6  w»(6  — 1) 

2  C _ m  — 6 

6-1 


E  ~r.P{K  >  .} 

1=0 


15m(6  -  l)e-m/6 
2(m  -  b)EK  ' 


Proof. 


If  P{K  >b}=  0,  then 


m  — 1 


max  E  —^—P{K  >  r\K  >  j}  <  — — 

>>i  "m-r  1 


m  —  b 


Apply  this  and  Proposition  3.20  to  the  bound  in  Proposition  4.3. 


□ 
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Remark.  If  m  =  100  and  each  A',  -  1  is  binomial  (4,  j),  the  second  error  bound  in 
Proposition  4.4  equals  1.96  x  10-4  (with  6  =  5  and  EK  =  3). 

Proposition  4.5. 

If  P{A  >  k}  <  Ce  ak  for  k  <  6,  C  >  0,  and  a  >  0,  then  the  difference  between  (1.1) 
and  EX  is  bounded  in  absolute  value  by 

+  2mie-^2)~0 8  m  +  Ce-°‘(-l0g  m)». 

2  E(K  Am)  E(K  Ab) 

Remark.  If  P{A  >  A:}  <  Ce  for  all  k,  then  letting  6  ~  y/m  in  the  Proposition  4.5 
bound  shows  that  the  approximation  error  converges  to  zero  faster  than  exp(— m1/3)  as 
m  — *  oo. 
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Proposition  4.6. 


If  the  hazard  function  of  the  distribution  of  the  K{' s 


P{K  =  a]  0 

K~  {*>„}  (=  lfor0) 


satisfies  ha  >  8  >  0  for  a  <  m,  then  the  difference  between  (1.1)  and  EX  is  bounded  in 
absolute  value  by 


^  -  •>*  { (T^r + 2m^m/2 + m2(i  -  *]  (^) 


Proof. 

ha  >  8  V  a  <  m  implies  P{K  >  r\K  >  j)  <  (1  —  Thus, 

max  — — — P{K  >  r\K  >  j }  <  maxm  (1  —  6)r-J,+1  <  . 

>>i  [  m-r  j> l  7  8 

r=J  -  r=> 

Combining  this  with  Proposition  3.28  gives  us  the  first  term  above  as  a  bound  on  the  first 
term  in  Proposition  4.3.  The  second  term  in  Proposition  4.6  follows  in  the  same  way  as 
the  second  term  in  Proposition  4.5. 

□ 


Remark.  If  m  =  100  and  the  KY s  are  geometric  (£),  then  taking  6  =  6 2  and  8  =  \  causes 
the  bound  in  Proposition  4.6  to  be  5.64  x  10-13. 


5.  Generalization 


Suppose  we  start  with  w  white  balls  and  m  —  w  red  balls  in  the  box.  Let  Yw<n  be  the 
number  of  iid  samples  needed  to  see  (or  paint  red)  n  of  the  white  balls.  (Thus,  X  =  Ym<m .) 
The  formula 


(5.1) 


W 

E 

»=u>— n+ 1 


E1  ^tP{K  >  i) 
1=0 


+ 


m  —  1  r 

E  ^P{K  >  r}  E  =» 

"*1  J=l 


m—>+l 


m  — 1 


E  ^r,P{K  >  0 

i=o 
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should  (usually)  be  a  good  approximation  for  EYWtn  when  the  first  term  of  (5.1)  is  large. 
The  argument  for  the  first  term  is  exactly  the  same  as  in  Section  2:  the  numerator  is  the 
expected  number  of  single-ball  draws  needed  to  get  the  required  number  of  white  balls,  and 
the  denominator  is  the  expected  number  of  draws  needed  to  complete  a  sample.  The  second 
term  of  (5.1)  (which  is  exactly  the  same  as  the  second  term  of  (1.1))  is  again  a  correction  for 
the  error  in  the  first  term  caused  by  “boundary  overshoot.”  In  this  case,  the  {(An,  V’TJ)}^_1 
chain  of  Section  3  starts  either  in  state  (1,  w)  or  in  state  (1,  —  1),  depending  on  whether 

the  first  ball  chosen  is  red  or  white.  The  ( An ,  Vn )  chain  is  absorbed  by  the  states  on  V-level 
w  —  n.  If  a  successful  coupling  can  be  achieved  with  high  probability  between  this  (An,  V„) 
chain  and  an  (A^\  Vn^)  chain,  then  the  A  coordinate  of  the  absorbing  state  will  have  a 
distribution  approximated  by  (3.12).  Providing  also  that  the  bound  in  Lemma  4.2  is  small, 
the  second  term  of  (5.1)  will  do  a  good  job  of  correcting  the  error  in  the  first  term. 
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